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Abstract 



Random walks on multidimensional nonlinear landscapes are of interest in many 
i— i areas of science and engineering. In particular, properties of adaptive trajectories on 

fitness landscapes determine population fates and thus play a central role in evolu- 
tionary theory. The topography of fitness landscapes and its effect on evolutionary 
dynamics have been extensively studied in the literature. We will survey the current 
research knowledge in this field, focusing on a recently developed systematic approach 
to characterizing path lengths, mean first-passage times, and other statistics of the path 
ensemble. This approach, based on general techniques from statistical physics, is appli- 
cable to landscapes of arbitrary complexity and structure. It is especially well-suited 
to quantifying the diversity of stochastic trajectories and repeatability of evolution- 
ary events. We demonstrate this methodology using a biophysical model of protein 
(T) evolution that describes how proteins maintain stability while evolving new functions. 

in 

O 1 Introduction 

m 

Random walks on networks are ubiquitous in nature. For example, consider proteins, the 
macromolecules that carry out a myriad of chemical and mechanical functions inside a cell pQ . 
Each protein is a chain of amino acids chemically bonded to make a linear polypeptide [2], 
and the sequence of amino acids determines the protein fold — a compact 3D conformation 
which has the minimum free energy. Unlike random heteropolymers, naturally-occuring 
proteins have unique folds that they achieve robustly and, in many cases, rapidly (on the 
time scales of micro- or milliseconds) starting from arbitrary unfolded conformations [3J. 

Proteins are produced in the unfolded state inside the cell and have to fold before they 
can function. Protein conformations are often represented by sets of dihedral angles (torsion 
angles denned by three bond vectors connecting four atoms |3J). Although in general the 
values of dihedral angles are continuous (i.e., atoms can freely rotate around dihedral bonds), 
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they are typically discretized in protein structure prediction algorithms. In this case, protein 
folding can be viewed as a random walk on a network with connectivity defined by the move 
set — a set of instructions for changing the dihedral angles in each step. 

The network is very high-dimensional. For example, for a relatively small protein with 
L = 100 amino acids, 2 dihedral angles per amino acid, and 10° dihedral angle increments, 
there are 36 200 possible conformations. With a simple move set that updates one angle at a 
time, each node is connected to 35 neighbors. In such a large space, how can a protein reach 
its unique folded shape on reasonable time scales? This problem is known as the Levinthal 
paradox [3], and key to its resolution is the idea of the protein folding landscape [5, 6J. In 
our example above, each protein conformation has a free energy which is a function of 200 
dihedral angles, forming a landscape over the network. The free energy values at a node and 
its neighbors determine rates of transition between nodes, e.g., according to the Metropolis 
algorithm [?]. This landscape is believed to have a global funnel shape, which allows the 
protein to efficiently find its folded structure through incremental moves without searching 
the entire space [6]. 

This picture generalizes to many other search problems on landscapes and networks in 
which each node, corresponding to a discrete (or discretized) state of the system, can be 
assigned a value of the objective function which sets the transition rates. As with protein 
folding, a major question is how the landscape topography and the move set determine the 
dynamics. For example, an important quantity of interest is the mean first-passage time 
(to the global minimum on the protein folding landscape, for example), which should be 
minimal in optimized algorithms [8]. 

The effect of landscape topography on dynamics is of particular importance in evolution- 
ary theory, the study of how populations of organisms change over time through mutation 
and natural selection [9]. The genotype (genetic state) of an organism is represented by a 
sequence a of letters drawn from an alphabet of size k. The sequence may represent nu- 
cleotides in genomic DNA ({A, C, G,T}, k = 4), amino acids in a protein (k = 20), or the 
presence/absence of a mutation at several genes across the genome (k = 2). Assuming a fixed 
number L of sites in each sequence, the space of all k L possible sequences represents a network 
with sequence nodes connected to each other if they differ by a mutation at a single site [10] . 
For simplicity we neglect recombination between sequences and insertion/deletion of sites 
which would redefine network connectivity and in some cases the total number of nodes. 
Each sequence a can be assigned a fitness value ^(cr), defined as the survival probability of 
an individual with that sequence [TT]. The resulting construct is called a fitness landscape or, 
more precisely, a genotypic fitness landscape [12J. Just as the folding landscape's structure 
is key to a protein's ability to reach its folded state efficiently, the fitness landscape is key 
to understanding how complex biological structures, such as bacterial flagella or the human 
eye, can arise through random, incremental mutations [TO] . 

1.1 Evolutionary dynamics 

In general, individuals in a population will have different sequences, occupying a distribution 
of points on the fitness landscape. However, in the limit u <C (NlogN) -1 [T3"l IT4"] . where 
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u is the mutation rate (defined as the probability of mutation per site per generation) and 
TV is the effective population size [HI [15], new mutations arise individually and either fix in 
the population or disappear from it on time scales that are short compared with the times 
between successive mutations [T6HT8] . Thus the population is monomorphic and, apart from 
short transition periods, occupies a single node in sequence space. The stochastic process 
of a new mutant appearing and fixing in the population is known as substitution, and the 
substitution rate from sequence a to a' is given by [H] 



where Nu is the total number of new mutations per generation (Nu <C (logiV)" 1 < 1), and 
4>(s) is the probability of a single a' mutant fixing in a population of a when the selection 
coefficient is s = J-{a')/ T{a) — 1 (s > for beneficial mutations, s < for deleterious ones). 

The exact form of the fixation probability 0(s) depends on the underlying population dy- 
namics. However, it is common to consider the strong-selection weak-mutation (SSWM ) limit 
in which beneficial mutations always fix and deleterious mutations always get eliminated [T9] . 
This approximation is accurate for \s\ ^> 1 and N ^> 1. Similar to zero-temperature Monte 
Carlo, the population can only undergo substitutions that increase fitness, and all substitu- 
tions occur with the same rate Nu (since for |s| ^> 1, <p(s) ~ 1 when s > and 0(s) ~ 
when s < 0). Thus adaptation follows trajectories on the landscape along which fitness 
increases monotonically. 

Other approximations may be more appropriate in different circumstances [2D] . For 
example, when 1 <C N\s\ <C N, <f)(s) ~ s for s > and 4>(s) ~ for s < 0. Thus 
deleterious mutations always get eliminated as before, but beneficial mutations fix at the rate 
Nus. More generally, while the true dynamics of real populations may involve interference 
between multiple simultaneous mutations and other complexities [21] , simplified evolutionary 
dynamics are useful when our main objective is to understand the role of the fitness landscape 
in constraining evolution. 

1.2 Epistasis 

The most basic aspect of fitness landscape topography is known as epistasis. Let the sequence 
a be a 1 a 2 . . . a L , where is the letter at site fj, 6 {1, . . . , L}. In general the fitness function 
J-^cr) cannot be decomposed into a sum of independent contributions from each site fi. This 
means that the fitness effect of a mutation at a given site may depend on the state of other 
sites. If this is true, the sites will be correlated, which can be thought of as a coupling or 
interaction between the sites, similar to a Hamiltonian for a system of interacting particles. 

Epistasis is precisely this interactive coupling in the context of genotypic sequences. We 
categorize types of epistasis according to the qualitative differences in the fitness effects of 
mutations. We summarize the four possible cases using a two-letter, two-site model in Fig. [T] 
in which sequence AA evolves into BB, which has the highest fitness. In the first case on the 
left of Fig. [TJ there is no epistasis: the fitness effect of A — > B substitution at site 2 is the 
same regardless of the state of site 1, and vice versa. Thus the fitness can be decomposed into 
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AB AB AB AB 

Figure 1: The four qualitative types of epistasis for a two- letter, two-site model. From left 
to right: no epistasis, where each mutation has the same additive effect on fitness regardless 
of the other site, yielding a linear landscape; magnitude epistasis, where the magnitude 
(but not the sign) of the fitness effect of a mutation depends on other sites; sign epistasis, 
where the sign of a mutation's fitness effect (beneficial or deleterious) depends on other sites; 
reciprocal sign epistasis, where multiple instances of sign epistasis can lead to multiple local 
fitness maxima. 

a sum of additive contributions from each site: T(a) = ^(o" 1 ) + J-2(c" 2 ), i-e., the landscape 
is linear in sequence space. Under SSWM evolutionary dynamics, both trajectories from AA 
to the global maximum at BB are accessible. 

In the second case of Fig. [TJ the fitness effect of A — > B at site 2 differs in magnitude but 
not in sign depending on whether site 1 has A or B. This situation is known as magnitude 
epistasis [221 [23] . Magnitude epistasis does not completely block any trajectories, although 
it may affect quantitative aspects of dynamics such as adaptation times. Note that there are 
two kinds of magnitude epistasis: one in which the fitness benefit of a mutation is enhanced 
by the presence of other mutations, and the "diminishing returns" case in which the fitness 
benefit is decreased by other mutations. 

The third case of Fig. [T] shows how the A — > B substitution at site 2 can have opposite 
effects on fitness depending on the state of site 1: it is deleterious if a 1 = A, but beneficial 
if (X 1 = B. Since the sign of the fitness effect depends on the other site, the situation is 
known as sign epistasis. Sign epistasis can significantly affect accessibility of genotypes on 
the landscape by blocking trajectories: under SSWM dynamics, the trajectory AA — > AB — > 
BB is unavailable since it requires a deleterious substitution. When sign epistasis exists 
at multiple sites, it is known as reciprocal sign epistasis, as shown in the fourth case of 
Fig. [TJ Reciprocal sign epistasis is a necessary condition for the existence of multiple local 
maxima |24J. As the examples in Fig. [Tjshow, epistasis underlies landscape ruggedness that 
can constrain evolutionary trajectories. Thus the existence and nature of epistasis is of a 
prime interest in evolution. 

1.3 Measures of landscape ruggedness and accessibility 

Numerous measures have been proposed to quantify ruggedness of fitness landscapes and 
accessibility of evolutionary trajectories (summarized in Ref. [25]). One commonly-used 
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measure is the number of local fitness maxima, which is indicative of the presence and type 
of epistasis [24J: more local maxima indicates a more rugged or epistatic landscape. For 
binary alphabets, deviations of the fitness function from linearity can be quantified by fitting 
a linear function and calculating the sum of squares of residuals, known here as a roughness 
parameter (25]. A more local measure of ruggedness can be obtained by considering all pairs 
of sites and all pairs of possible letters at those sites, and then determining the sub-landscape 
for each combination like those shown in Fig. [T] Each sub-landscape can then be classified 
into one of the four epistasis types. 

Other measures consider accessibility and other properties of the trajectories themselves, 
especially those leading to the global fitness maximum. For example, without epistasis all 
trajectories to the global maximum from any other point on the landscape are accessible 
under SSWM dynamics, but with sign epistasis some trajectories become blocked. The 
distributions of trajectory times and lengths are also important for understanding the effect 
of landscape ruggedness on adaptation [261430] . 

1.4 Diversity of evolutionary trajectories and evolutionary deter- 



Landscape ruggedness is especially relevant in its effect on the predictability of evolution, 
a question of paramount importance in evolutionary theory [21]. If "life's tape" could be 
replayed, would we see a completely different outcome because evolution is a largely stochas- 
tic phenomenon, or are accessible evolutionary trajectories so constrained that the outcome 
would have been the same or recognizably similar [32]? Discussing this question in general 
entails many issues, including environmental conditions, initial conditions, and details of 
population dynamics. More specifically, one can focus on the diversity of transition path- 
ways in evolving from an ancestral state to a particular descendant state or set of states. 

One assessment of this diversity is simply counting the number of distinct accessible 
trajectories connecting an initial sequence a with a final sequence a'. For example, Weinreich 
and co-workers found that only a small fraction of all trajectories from wild-type E. coli 
to a strain resistant to antibiotics was accessible to adaptive walks [33J El]- in another 
approach, Koonin and co-workers devised a measure called mean path divergence to quantify 
the diversity more precisely [311 E5] : 



where the sum is over all pairs of distinct paths in an ensemble, p(<p) is the probability of 
path ip, and d(<fi, y^) is the distance between paths <fi and </? 2 , defined as the average of the 
shortest Hamming distances between each point cr 1 on path (pi and all points on path <p 2 , 
and vice versa: 
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where £[<p\ is the length (number of steps) of path <p, and h(ai, ip%) is the shortest Hamming 
distance between <j\ and all points oi £ (pz- The divergence therefore captures not only how 
many paths are available, but weighs them by their spatial proximity. Other measures of 
path diversity, such as path entropy and the distribution of path lengths and times [30J, are 
discussed later in this article. 

1.5 Model and empirical landscapes 

A few simple models have traditionally dominated theoretical studies of fitness landscapes 
and evolutionary trajectories. These models serve as useful null hypotheses or limits of more 
complex scenarios; they generally consider sequences with binary alphabets, in which case 
sequence space is a unit hypercube. Without attempting to account for every landscape 
proposed in the literature, we will discuss and motivate several more popular choices. In 
Kauffman's NK model [361 EZJj each of the L sites in the gene (or genes in the genome) 
interacts with K other sites chosen by random sampling. The fitness of genotype a is given 
by 

L 

H°) = E • • • ' cjnA ' (At) )' ( 4 ) 

where ni(p), . . . ,tik(p) are interaction partners of site \i. The single-site fitnesses / M are 
obtained by sampling from a continuous distribution; each combination of 2 K+1 possible 
states of the argument corresponds to an independent sampling. When K — 0, the NK 
landscape becomes fully additive and thus non-epistatic. Because in this limit the landscape 
is smooth and has a single maximum, it is sometimes called the "Mount Fuji" model [38] . The 
amount of epistasis, or landscape ruggedness, can be tuned by increasing K to the maximum 
value of L — 1. With K = L — 1, the fitnesses of different sequences are uncorrelated; 
this model is called the "House of Cards" [39] due to the unpredictable fitness effects of 
mutations. Since closely-related genotypes will realistically have correlated fitness, this limit 
serves mainly as a null model. Various properties of the NK model are known: [2"6" 1 [281, 13"6"l HU] 
for example, in the K = L — 1 limit the average number of local maxima is k L /(L(k — 1) + 1) 
for any alphabet size k [25J. 

Another class of models starts from a non-epistatic landscape and adds noise to it. For 
example, in the "rough Mount Fuji" model [38J, sequence cr is arbitrarily picked as the 
global maximum and the fitness of sequence a is given by 

T{a) = T](a) — Od(cr, er ), (5) 

where d(a, (To) is the Hamming distance between sequences a and ctq, 6 is the parameter which 
controls the slope of the smooth part of the landscape, and rj(a) is a zero-mean random 
variable sampled independently for each sequence a. The ruggedness of the landscape is 
controlled by the ratio of 9 and the standard deviation of the distribution from which the 
random variables rj(a) are sampled. 
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The landscapes described above are dominated by selection. Another approach to evo- 
lution is based on the neutral theory, which postulates that the majority of mutations have 
either no phenotypic effect (i.e., are selectively neutral) or are strongly deleterious and thus 
removed from the population [16J. This picture leads to evolution on a neutral network 
where all viable nodes have the same fitness [TOf |4"T] . With some probability, a viable indi- 
vidual can acquire a lethal mutation from which it cannot recover, and will disappear from 
the population. Thus the population as a whole can only make transitions between viable, 
selectively neutral states. Evolution on such a landscape is reminiscent of the percolation 
problem [42]: each node is assigned fitness 1 with probability p and fitness with probability 

1 — p, independent of the other nodes [29|. 

Due to the enormous number of sequences involved, the structure of fitness landscapes 
is difficult to probe experimentally. Typically, only a small number of sites is studied (4 
to 9, summarized in Ref. |25j). and at those sites, only a subset of all possible mutations 
is introduced, resulting in fitness measurements for O(10 2 )-O(10 3 ) different genotypes. In 
addition, because genotype survivability is not directly accessible in experiments, proxy 
measures of fitness are employed, such as growth rates and antibiotic resistance. Although 
these empirical studies can be used to probe the local structure of the landscapes, they are 
insufficient for analyzing the global properties of adaptive trajectories because adaptation 
may involve mutations outside of the experimentally-probed subset. 

Nevertheless, many studies have attempted to characterize empirical landscapes in terms 
of their epistatic features, accessibility, and correspondence to theoretical models [221 123 
|2"9"| 03J. For example, magnitude and sign epistasis have been observed, as well as 

significant constraints on evolutionary trajectories. One general finding of such studies is 
that empirical landscapes include some epistasis, but are far from the House of Cards regime 
in which all fitness values are completely uncorrelated [22]. The emerging picture is closer to 
the rough Mount Fuji model, which includes a limited amount of epistasis around a mostly 
linear landscape [23] . 

2 Statistical physics of stochastic paths 

Analytical treatments of evolutionary dynamics on fitness landscapes are typically restricted 
to uncorrelated or highly symmetric models, such as those previously described. Simulations, 
meanwhile, can suffer from numerical inaccuracy and be computationally expensive when 
rare events are considered. More systematic tools are necessary, especially tools that directly 
address statistical properties of stochastic paths that are relevant for understanding the 
diversity of evolutionary pathways. 

Physics and chemistry have long grappled with similar problems in the field of reac- 
tion rate theory [H], which studies rare transitions between metastable states that model 
phenomena ranging from protein folding [3 J to chemical reactions [45J. In these systems, 
quantities of interest include not only mean first-passage times and reaction rates but also 
the spatial distribution of transition paths and identification of kinetic bottlenecks. 

Transition state theory is a well-known approach to these problems; however, it relies 
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on the existence and a priori identification of key transition states [33]. A more recent 
development has been transition path sampling [35H50] , in which paths are directly sampled 
via Monte Carlo to estimate their statistical properties. Similar methods have been used in 
phylogenetic analysis of protein sequences [5TM55] . These techniques are based on a finite 
sample of paths and do not provide natural cutoffs for the size of the path ensemble, which 
may lead to noisy estimates of various path statistics. Another technique, called transition 
path theory [5EH5T)] . relies on explicit solutions to the backward equation. This approach, 
though more systematic, does not directly address the diversity of paths. 

Here we discuss a general statistical mechanics treatment of stochastic paths that provides 
many useful tools for analyzing evolutionary models and other stochastic processes [30]. A 
semi-Markov process (i.e., continuous-time random walk [HI]) on the state space S consists of 
discrete jumps between states and continuous-time waiting within states; the jump process 
is memoryless, but the waiting process need not be. Thus the process is defined by a set of 
jump probabilities, (er'|Q|cr) for the jump a — > a' (o-,a f G S) and waiting time distributions 
ip a (t), which is the PDF of waiting exactly time t in state a before jumping out. Assume 
ip a (t) has finite mean w(o~) for all a G S. For Markov processes with memoryless waiting, 
ip a (t) = e~ l l w ^ I w(o~). Non-exponential ip a (t) can also arise due to coarse-graining of Markov 
processes [62H6^] . The space S equipped with the jump matrix Q defines a network with 
directed, weighted edges. 

Define a trajectory as a path through state space ip = {cr , a±, . . . , ae} combined with a 
set of intermediate waiting times {to,ti, . . . where tj is the waiting time in <jj. We 

consider the trajectory finished once it reaches the final state a?, so we do not count the 
waiting time in that state. The probability functional of starting in the initial state o"o and 
completing the path (p no later than time t is 

fl-\ \ n-\ oo \ / e-i \ 

n[^t]=7r(«7 ) (n^iQioi)) ( n/ o dt * *t>°M\ 6 i*~em ' ( 6 ) 

where the first factor is the initial state distribution Tr(a ), the second is the product of jump 
probabilities, the third integrates over waiting times, and the fourth constrains the total 
waiting time to be less than t (0 is the Heaviside step function). In the limit t — > oo we 
obtain the probability of the path (p for any duration, 

i-i 

Hoof^] = 7r(er ) JJ^i+ilQI ^), (7) 

which is just the product of jump probabilities. In the time-dependent case, the Laplace 
transform of Eq. [6] results in a simpler expression due to deconvolution |65j: 

n[M = ^n<<7 i+1 iQK) ( g ) 

i=0 

where ^(s) is the Laplace transform of ip ai (t). For Markov processes, Eq. M becomes [66] 
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%, a ]=^n < f wlQ h > . o) 

ir J s 11 1 + sw(a) K ' 



i=0 



2.1 Path ensemble averages 

The distribution II in principle contains all statistical information on a set of paths. 
However, direct analysis of this distribution is typically prohibitive due to the high dimen- 
sionality of path space. The simplest alternative entails taking averages of various path 
properties over this distribution. Let $ be an ensemble of paths that defines some dynami- 
cal process; for example, this may be all first-passage paths from a set of initial states Si to 
a set of final states Sf. The partition function for this ensemble is 

(io) 

which represents the total probability of reaching Sf from Si by time t. 
We define the following path functionals: 



C[<p] = length of <p XM = { J f t ^J se 

i-i e-i 

T &} = ^2 W ( a i) T„[(p] = ^S^.wiai) (11) 

We can now express various path statistics as averages of these functionals over the ensemble, 
conditioned on completing the process by time t. For example, the average time of paths is 
given by 

f«(t) = (T(t)U = ^E^**]- ^ 

*v ) 

The distribution of path lengths is given by 

from which the average length £$(t) = (C(t))§ and standard deviation of length £*$(t) are 
readily obtained. 

Averages over state-dependent functionals can be used to characterize the spatial struc- 
ture of paths. For example, the fraction of time paths spend in a state a can be expressed as 
(Tcrit)) <t> / f&(t); this is a normalized distribution over states a and therefore can be considered 
as a density of states on paths in the ensemble $. In contrast, the probability that a path 
will hit a state a is given by (Z CT (£))$. We will refer to this as the density of paths in the 
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ensemble <3>. We can also express the two-point correlation function (X (T /(t)X CT (t))$, which 
gives the probability of paths passing through both a' and a. 

In many cases we are interested in the time-independent versions of these quantities, i.e., 
we focus on statistical properties of paths taking any amount of time to finish. These can 
be obtained as the t — > oo limit of the above expressions, which amounts to replacing H[(p, t] 
(Eq. [6]) with n^k/?] (Eq. [7]). We will denote these time- independent properties by simply 
omitting the time dependence, e.g., lim^oo f$(i) = f$. 

Our formalism also allows for development of path thermodynamics. The entropy of the 
path ensemble is given by 



= - (io g n(t))* + iogz*(t). 



(14) 



Indeed, if we consider the path Hamiltonian 



H[<f,t} = -\og(U[ip,t}), (15) 
we can express the path ensemble free energy as 

F (t) = (Hit)),, - S*(t) = -logZ*(t). (16) 

The partition function Z$(t) monotonically increases with time. Therefore the free energy 
F$(t) monotonically decreases as t — > oo, corresponding to equilibration of the path ensem- 
ble. 

For recurrent processes (i.e., those in which the system will almost surely reach the final 
states eventually [S3]), lim^^ Z<$, (t) = Z<$, = 1, and hence equilibrium free energy is zero. 
In these cases, equilibrium path entropy is equal to the average energy. If the ensemble $ 
consists of only a single path with nonzero probability, its entropy is S& = 0. This situation 
may arise if a landscape is so constrained that only a single viable pathway exists between 
the initial and final states. Conversely, consider a purely random walk on a homogeneous 
network with 7 nearest neighbors per node. The jump probability between any pair of 
neighboring nodes is thus q = I/7, so any path ip has probability Ilooko] = q c ^\ and the 
entropy of the ensemble is given by 



5$ = -(loglloo^ = -Ulogq. (17) 

Note that path entropy and path energy scale with the average path length, which defines a 
notion of extensivity in the path ensemble. This is sensible if we think of a path as a gas of 
particles, where each jump in the path corresponds to a particle. The path ensemble, which 
includes paths of many lengths, therefore is equivalent to the grand canonical ensemble of 
the gas. In the case of the gas, extensive quantities like entropy and energy scale with the 
number of particles, and hence these quantities here scale with the path length. In the case 
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of evolutionary dynamics on fitness landscapes, ~ k L and q ~ (L(k — 1)) 1 (where L is 
the sequence length and k is the alphabet size), yielding 

S$ ~ k L log L(k — 1). (18) 
2.2 Numerical algorithm 

The factorized form of the path probability distribution functional (Eqs.[6j[7]) permits efficient 
calculation of path ensemble averages via a recursive algorithm. Here for simplicity we 
consider the time-independent case. Let \tt) be the vector of initial state probabilities and 
\a) be the vector with 1 at position a and otherwise. For each step I and intermediate 
state a, we can recursively calculate Pe(cr) = (cr|Q^7r) and Tf(cr), the total probability and 
average time of all paths that end at a in i steps: 

P e (a') = WQWPi-iW, (19) 

nn a of <x' 



nn a of <x' 



where Pq{(t) = tt(ct), T (cr) = 0, and the sums run over all nn a of a' {a' G Sf are treated as 
absorbing states). This procedure generalizes the exact-enumeration algorithm of Ref. [67J. 
Therefore = Y,7=i T,aes f P e( a ) and 

oo 

^w = ^E p ^)' r ^ = ^EEw ( 20 ) 

* aes f * e=i aes f 

Other ensemble averages such as S$, (T a T a i)<s>, and {T a )<s> can be calculated similarly. 

Furthermore, we can calculate mean path divergence that characterizes the spatial diversity 
of the paths in $: 

oo 

V * = E E d(a,a')P e (a)P e (a'), (21) 

where d(a, a') is a distance metric on S. Our definition is distinct from that proposed in 
Refs. [2D ES] (Eq. [2J in that it dynamically calculates distances between points on paths as 
they propagate, rather than comparing the minimal distance between complete paths. Thus 
for a path that revisits some states multiple times, the divergence with a path that travels 
through the same set of states without revisiting any of them will be zero according to Eq. [2j 



but nonzero with the definition in Eq. 21 



Our algorithm allows for very general definitions of the path ensemble $ without having 
to explicitly enumerate all the paths. For instance, $ can include paths that begin and end 
at arbitrary sets of states, or are not allowed to pass through arbitrary sets of intermediate 
states. Restriction to first-passage paths is also straightforward. The time complexity of 
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Figure 2: Path ensemble statistics in a neutral network model. (A) The path length dis- 
tribution p<s>(£) (solid, blue) and exponential fit (dashed, green) in the interval [A — 5, A] 
for A = 25 for a single realization of a neutral network with p = 0.9. (B) Distribution of 
mean times of paths, (C) distribution of mean path lengths and standard deviations of 
path lengths £ s $, and (D) distribution of path entropies S$ for p = 0.1 and p = 0.9. All 
quantities in A-D are per site. Histograms in B-D are generated from 10 4 successful random 
realizations of the neutral network for each value of p; a realization is considered successful 
if both initial and final states are included in a single connected subnetwork. 



the algorithm is O^NA) (0(7 A 2 A) for £>$), where 7 is the average number of nn defined 
above, N is the number of states visited by paths in $, and A ~ is the cutoff path length. 
For simple random walks, ~ 

N d w /d f for d w > d f and ~ N for d w < df, where d w is 
the dimension of the walk and df is the fractal dimension of the space |68l [69] . Therefore, 
the algorithm scales as 0{^N 1+dw / d f) for d w > df and 0("fN 2 ) for d w < df, automatically 
accounting for the sparseness of network connections. This scaling compares favorably with 
standard linear algebra algorithms, which in general require 0(N 3 ) operations [70J to solve 
the backward equation [5B1 159"] . 
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2.3 Evolution on a neutral network 



As a simple application of our approach, we consider a population evolving on a neutral 
network [29] . In the space of all sequences of length L and with an alphabet of size k, we 
assign each sequence fitness 1 with probability p or fitness with probability 1 — p. The 
subset of fit states connected to each other forms a neutral network; there can be several 
disconnected neutral subnetworks in each landscape realization. We choose L = 8 and a 
binary alphabet {A, B} (k = 2), which gives 2 8 = 256 fit and unfit nodes in the network, 
and we consider the ensemble of first-passage paths from the sequence AAAAAAAA to the 
sequence BBBBBBBB. 

Figure [2jA. shows p$(£) for a single realization of this model with p = 0.9. The exponen- 
tial tail of pq>(£) is a universal feature of first-passage processes on finite spaces [HE]; other 
path statistics, such as the average time T&(£) of paths up to length £, also show asymptotic 
behavior that is exponential for long paths. Indeed, we can use this feature to determine the 
cutoff path length A for the algorithm: one need only consider paths with £ < A and infer 
the contributions of all longer paths from an exponential fit to the tail, which considerably 
improves the efficiency of the algorithm. This procedure takes advantage of the fact that 
information about longer paths is already contained in the structure of shorter paths; the 
maximum length A of the shorter paths that must be calculated directly depends on the 
chemical distance between the initial and final states and the lengths over which the land- 
scape is correlated. This essentially implements a numerical renormalization scheme on the 
ensemble of paths [B"B"] . 

In Fig. [2^3, C,D we show distributions of the mean path time f$, mean path length £$, 
path length standard deviation £*$, and path entropy S$ for multiple realizations of the 
neutral network with high and low values of p. We see that long paths are likely in these 
models; dozens of substitutions can occur at each site before the final state is reached. The 
larger size of the neutral network for p = 0.9 allows longer paths on average than for p = 0.1. 
However, the mean time of paths for the larger neutral network is usually smaller (Fig. [2^>); 
this is because the increased connectivity of the network leads to shorter waiting times at 
individual nodes. Larger p leads to substantially more diversity of paths and path lengths, 
as expected due to the increased size and connectivity of the network (Fig. [2p,D). Moreover, 
with p = 0.9 the neutral network is nearly the size of the entire sequence space, and thus 
q ~ L^ 1 in Eq. 17, leading to a factor of log L m 2.1 difference between and S$ (Fig. [2]C,D). 
Finally, note that the distributions of and £ s $ in Fig. J2p are nearly the same, owing to 
the nearly exponential distribution of p<s>(£) in this model (cf. Fig. [2 



3 Biophysics of protein evolution 

We now consider a more realistic example of protein evolution on a fitness landscape which 
depends on protein folding stability and energetics of intermolecular interactions (3j. Many 
recent studies have focused on how proteins evolve under the constraint of maintaining 
thermodynamic stability of their folded state [TTHT5] . Suppose that an organism encodes a 
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particular protein, and that the protein contributes fitness 1 if it is folded and fo < 1 if it is 
unfolded. Then the total fitness averaged over all proteins in an organism is given by 



HE,) = < 22 > 

where Ef is the free energy of folding (i.e., the free energy difference between folded and 
unfolded states of the protein) and = 1.7 (kcal/mol) -1 is inverse room temperature. 



Equation [22] thus states that more reliably-folding proteins confer commensurate fitness 
advantages; often it is assumed that unfolded proteins are completely nonfunctional, and so 
fo = 0. Some studies simplify this further by assuming that the folding energy Ef need only 
be below a particular threshold E^; below that threshold all proteins are adequately stable 
and equivalent in fitness. Mathematically, 

HEf) = ®{Ef-E s \ (23) 

where G is the Heaviside step function. This model is equivalent to the zero-temperature 
limit of Eq. [22] Similar approaches based on protein-DNA interaction energies have also 
been used to study evolution of gene regulation [T7], [7DHE3] ■ 

Here we consider a model that includes both protein stability and function, [30] which 
we take to be the ability to bind a target such as an enzymatic substrate or another protein. 
Let Eb be the free energy of binding relative to the chemical potential of the target molecule, 
so that the probability of binding is 1/(1 + e^ Eb ). We assume that the protein contributes 
fitness 1 if it both folds and binds, and f < 1 otherwise [EI] . Then fitness averaged over all 
proteins in an organism is given by 

Eb) = (l + e^/)(l + e^) • (24) 

The folding and binding energies depend on the amino acid sequence a. Many proteins 
have only a small number of residues at the binding interface that contribute the majority 
of the binding affinity; these residues are known as the "hotspot" [85j. We assume that 
there are L such residues and that they make additive contributions to the total binding and 
folding free energies [86] : 

L L 

E f (a) = E f ] + J2 e f^°n, E b (a) = E° b + y £e b (n,a»), (25) 
ij,=i ^=1 

where E®, E® are overall offsets and e/,&(/i, cr^) is the energy contribution of amino acid a^ 1 at 
position fi. The offset E® is a fixed contribution to the folding energy from the other residues 
in the protein, which we assume to be perfectly adapted. We sample e/s from a Gaussian 
with mean 1.25 kcal/mol and standard deviation 1.6 kcal/mol, consistent with computational 
studies showing the mutational effects on stability are universally distributed [87J. Since 
binding hotspot residues are typically defined as those having a minimum penalty of 1-3 
kcal/mol for mutations away from the wild-type amino acid [HE], we set e;,(/i,o"^ b ) = for 
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all (jl = 1, . . . , L (o"bb is the best-binding sequence: E b (a^) = E®), and we sample the other 
e^'s from an exponential distribution defined in the range of (1, oo) kcal/mol, with mean 
2 kcal/mol. The parameters of the exponential distribution are consistent with alanine- 
scanning experiments which probe energetics of amino acids at the binding interface [89J. 
By construction, E® is the binding free energy of the best-binding sequence, measured relative 
to the chemical potential of the target molecule. Here we consider L = 5 hotspot residues and 
a reduced alphabet of k = 8 amino acids grouped by physico-chemical properties, resulting 
in 8 5 = 32768 possible sequences. Different choices of these parameters can be considered, 
but they appear to have little effect on the overall qualitative features of the model. 

We consider a population of individuals whose genomes encode a protein of interest. In 
each individual, the sequence of the protein begins as perfectly adapted to binding a certain 
target molecule. Then the population is subjected to a selection pressure that favors binding 
to a new target. This situation is common in directed evolution experiments which attempt 
to evolve new protein functions in laboratory settings [90] . To model such experiments, we 
sample one set of e/s and two sets of e^'s (one for each target), while E® and E® are assumed 
to be fixed. This procedure defines two fitness landscapes, T\ and J" 2 , through Eq. 24 the 
entire population begins at the global maximum on T\ and proceeds to adapt to a new global 
or local maximum on Ti- We assume the SSWM limit as described in the Introduction; for 
Markovian waiting times, the jump probabilities are thus (er'|Q|er) = l/b(a) if J r (o J ) > jF{cf) 
and zero otherwise, where b(a) is the number of beneficial substitutions possible from a. 
Note that in this limit our results are independent of /o and the mutation rate and effective 
population size only affect the overall time scale. The path ensemble consists of all adaptive 
paths (APs); fitness monotonically increases along each path. In Fig. [3] we show three 
realizations of T% with examples of APs. 

We focus on the generic properties of these fitness landscapes, averaged over many real- 
izations of €f and 6b (Fig. |4|). Varying E® and E® reveals two qualitatively different phases 



of adaptation. When E® is low and E® is high (see Fig. |3|A. for an example), adaptation is 
in the binding phase, i.e., the need to bind the new target molecule dominates evolutionary 
dynamics. In this phase, there is typically a single local fitness maximum which coincides 
with the best-binding sequence a^b (Fig. |4jA.)- In contrast, when E® is high and E® is low 
(see Fig. [3p for an example), adaptation is in the folding phase, where evolution is mostly 
constrained by the need to maintain and increase protein stability. In this case there are also 
few local maxima and they tend to be close in sequence space to the best-folding sequence 
<7bf (Fig. |4jA). Between these phases there is a crossover regime, where folding and binding 
compete more equally in shaping the landscape and adaptation (see Fig. [3j3 for an example). 
The crossover regime has the most epistasis, as indicated by the number of local maxima, 
the accessibility of those maxima, and the fraction of fitness landscape realizations in which 
the global maximum has the largest commitment probability (Fig. |4jA.,B). The differences in 
the landscape structure in the binding and folding phases lead to substantial differences in 
adaptive dynamics. In particular, APs are longer and take more time in the binding phase 
compared to the folding phase; they are also more diverse (Fig. |4p,D). Initial and final states 
in the binding phase are separated by longer Hamming distances (Fig. pp). In the folding 
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Figure 3: Three realizations of the fitness landscape. The offsets E® and E® are different for 
each realization, but e/'s and the two sets of e^'s (one for T\ and another for T-i) are the same. 
(A) Binding phase, with E® = — 17 kcal/mol and E® = —3 kcal/mol. (B) Crossover regime, 
with Ej = —9 kcal/mol and E® = —11 kcal/mol. (C) Folding phase, with E® = —3 kcal/mol 
and E® = —17 kcal/mol. Top panels of A-C show the global distribution of all 8 5 = 32768 
sequences in energy space according to Ti, where the blue crosses indicate the best-folding 
(<Tbf) and best-binding (abb) sequences, red triangles indicate local fitness maxima on T% 
(shaded according to their commitment probabilities), and black stars indicate the initial 
state for adaptation (sequence with global maximum on Black lines are contours of 

constant fitness Ti- In the bottom panels of A-C only the region of energy space accessible 
to APs is shown; this region is outlined by dashed lines in the top panels. Example APs 
are shown in blue and green, and black circles indicate intermediate states along APs, sized 
proportional to the AP density (X (J )ap. 

phase, there is an appreciable probability that no adaptation occurs, since the initial state 
may coincide with one of the local maxima (Fig. (4^3). 

Our model reproduces several important features of molecular evolution observed in ex- 
perimental studies. First of all, adaptive dynamics may involve tradeoffs between folding 
and binding frequently observed in directed evolution experiments [901492] , even though mu- 
tational effects on folding and binding energies are uncorrelated (93]. This coupling between 
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Figure 4: Statistics of fitness landscapes and dynamics averaged over multiple landscape 
realizations. All quantities in A-D are per- residue (except number of local maxima). (A) Av- 
erage number of local fitness maxima (solid, green), the average Hamming distance Sf (num- 
ber of mutations) between the maxima and the best- folding sequence cm (dashed, blue), 
and the average distance 5b between the maxima and the best-binding sequence o"bb (dotted, 
red) for the parameter subspace E® + E® = —20 kcal/mol. Note that the average distance 
between two random sequences is 1 — 1/k — 0.875, where k = 8 is the size of the alphabet. 
(B) Fraction of local fitness maxima accessible from the initial state (dashed, blue), fraction 
of all landscape realizations in which the global maximum has the largest commitment prob- 
ability among all local maxima (dotted, red), probability that the initial sequence starts at 
a local maximum resulting in no adaptation (dashed and dotted, cyan), and fraction of all 
sequences accessible to APs (solid, green). (C) Mean AP length £ap, standard deviation 
of APs, average distance 5 between the initial state and the final states, and average length 
£ max of the longest APs connecting the initial state with the final states. (D) Path ensemble 
entropy Sap (dashed, blue) and the mean duration of paths tap (solid, green), in units of 
inverse population mutation rate (Nu)^ 1 . The probability of no adaptation in (B) is an 
average over 2 x 10 4 landscape realizations; all other data points are averages over 5 x 10 3 
realizations, and realizations with no adaptation are excluded. 
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folding and binding is introduced through nonlinearities in the fitness function J~(Ef, E^ 



(Eq. 24), which contains both magnitude and sign epistasis. We note that although our 



landscapes are generated from randomly-drawn parameters €f and e&, similar to many clas- 
sical model landscapes, these these protein landscapes are highly correlated: fitness values 
of k L sequences are determined by 2Lk parameters. Indeed, the average number of local 
maxima on a House of Cards landscape with L = 5 and k = 8 is ~ 910, while the protein 
landscape generally has no more than 7 (Fig. Thus, our models are less epistatic than 
completely uncorrelated landscapes [37]. These more moderate levels of epistasis are consis- 
tent with previous analyses of empirical fitness landscapes [221 1251 ES]- In the folding phase, 
APs tend to be short and no adaptation may occur if the old global maximum coincides 
with the new local maximum (Fig. [4)3). This lack of adaptation is sometimes observed in 
experiments in which a protein already exhibiting some affinity for the new ligand cannot 
readily increase it any further [90]. In the crossover regime, the tradeoff between binding and 
folding alone can result in proteins with marginal folding stability, in contrast with previous 
hypotheses that explain marginal stability with mutational entropy [71] or a fitness function 
that disfavors hyperstable proteins |72j . 

Our model can be extended to account for binding-mediated stability, in which binding 
stabilizes an otherwise disordered protein [91]. In this case, instead of considering folding 
and binding to be independent as in Eq. 24 , we assume that the protein can only bind when 
it is folded. The unbound protein may be biased toward the unfolded state, which effectively 
creates a free energy barrier between the "unfolded and unbound" state and the globally 
most favorable "folded and bound" state. The fitness function in this model is given by 
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1 _|_ e PE b + e f3(E f +E b ) 

Here, proteins may still have high fitness even if Ej > as long as Ef + Et> < 0, as expected 
for a protein stabilized by binding. Our methodology can be straightforwardly applied to 
this case as well, again revealing the existence of binding- and folding-dominated phases. 
See Fig. [5] for an example landscape. 

We can also incorporate chaperone-assisted folding [95] by modifying E® or the distri- 
bution of 6f's, and include "folding hotspots" away from the binding interface, which may 
acquire stabilizing mutations as a buffer against destabilizing but function-improving muta- 
tions at the interface [901 H] • Neutral and slightly deleterious mutations can be incorporated 
as well by using substitution rates from more complex population genetics models [TBI [18] , 
although we expect non-adaptive substitutions to play little role on short time scales. 

In summary, we have described a general methodology for studying stochastic paths 
on arbitrary landscapes and networks. This approach is general and can be applied to 
numerous dynamical problems in physics, chemistry, biology, and engineering, including 
protein folding, transport and search in complex media, stochastic phenotypes, and cell- 
type differentiation. In this review we have emphasized its utility in exploring evolutionary 
problems, which can often be visualized as random walks on fitness landscapes [12] and where 
the diversity and reproducibility of evolutionary paths is a central issue. We believe that 
the path-based methodology is well-suited for providing intuitive path statistics in problems 
whose complexity and high dimensionality make direct visualisations impossible. 
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